feat: AsyncPipeline that can schedule components to run concurrently #8812

mathislucka · 2025-02-04T17:04:23Z

Related Issues

closes components should run concurrently when not explicitly waiting on inputs #8453

Proposed Changes:

Implements an AsyncPipeline that supports:

running pipelines asynchronously
step-by-step execution through an async generator
concurrent execution of components whenever possible (e.g. hybrid retrieval, multiple generators that can run in parallel)
sync run-method with concurrent execution of components

How did you test it?

unit tests
adapted behavioral tests to use Pipeline and AsyncPipeline

Notes for the reviewer

Review after #8707
Code was reviewed here before: deepset-ai/haystack-experimental#180

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

mathislucka · 2025-02-06T15:36:39Z

@Amnah199 @davidsbatista much smaller diff now that the other PR is merged.

This is largely the same as the PR that we already merged to experimental with the following differences:

fixed bug where we didn't wait long enough for DEFER(_LAST)
added pipeline type to telemetry

davidsbatista · 2025-02-06T16:32:42Z

haystack/__init__.py

@@ -23,6 +23,7 @@
    "default_to_dict",
    "DeserializationError",
    "ComponentError",
+    "AsyncPipeline",


(nit) suggestion: keeping these imports ordered alphabetically helps locate something as the list grows

__all__ = [ "Answer", "AsyncPipeline", "ComponentError", "DeserializationError", "Document", "ExtractedAnswer", "GeneratedAnswer", "Pipeline", "PredefinedPipeline", "component", "default_from_dict", "default_to_dict", ]

davidsbatista · 2025-02-07T13:04:51Z

I did another quick review, although most of this was already reviewed before

From my side it's approved, but to play safe, let's wait for Amna to also do another quick review before merging.

davidsbatista

LGTM

test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <[email protected]>

releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

Amnah199 · 2025-02-07T13:37:59Z

test/core/pipeline/test_async_pipeline.py

+    async_loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(async_loop)
+


Here as well, we can avoid manual handling of loops by using asyncio.run, if you feel that would be better.

Amnah199 · 2025-02-07T13:41:51Z

test/core/pipeline/test_async_pipeline.py

+def test_async_pipeline_reentrance(waiting_component, spying_tracer):
+    pp = AsyncPipeline()
+    pp.add_component("wait", waiting_component())
+
+    run_data = [{"wait_for": 1}, {"wait_for": 2}]
+
+    async_loop = asyncio.new_event_loop()
+    asyncio.set_event_loop(async_loop)
+
+    async def run_all():
+        # Create concurrent tasks for each pipeline run
+        tasks = [pp.run_async(data) for data in run_data]
+        await asyncio.gather(*tasks)
+
+    try:
+        async_loop.run_until_complete(run_all())
+        component_spans = [sp for sp in spying_tracer.spans if sp.operation_name == "haystack.component.run_async"]
+        for span in component_spans:
+            assert span.tags["haystack.component.visits"] == 1
+    finally:
+        async_loop.close()


Something like this? (although I didnt test it)

Suggested change

def test_async_pipeline_reentrance(waiting_component, spying_tracer):

pp = AsyncPipeline()

pp.add_component("wait", waiting_component())

run_data = [{"wait_for": 1}, {"wait_for": 2}]

async_loop = asyncio.new_event_loop()

asyncio.set_event_loop(async_loop)

async def run_all():

# Create concurrent tasks for each pipeline run

tasks = [pp.run_async(data) for data in run_data]

await asyncio.gather(*tasks)

try:

async_loop.run_until_complete(run_all())

component_spans = [sp for sp in spying_tracer.spans if sp.operation_name == "haystack.component.run_async"]

for span in component_spans:

assert span.tags["haystack.component.visits"] == 1

finally:

async_loop.close()

def test_async_pipeline_reentrance(waiting_component, spying_tracer):

"""

Test that the AsyncPipeline can execute multiple runs concurrently and that

each component is called exactly once per run (as indicated by the 'visits' tag).

"""

async_pipeline = AsyncPipeline()

async_pipeline.add_component("wait", waiting_component())

run_data = [{"wait_for": 1}, {"wait_for": 2}]

async def run_all():

tasks = [async_pipeline.run_async(data) for data in run_data]

await asyncio.gather(*tasks)

component_spans = [

sp for sp in spying_tracer.spans

if sp.operation_name == "haystack.component.run_async"

]

for span in component_spans:

expected_visits = 1

actual_visits = span.tags.get("haystack.component.visits")

assert actual_visits == expected_visits, (

f"Expected {expected_visits} visit, got {actual_visits} for span {span}"

)

# Use asyncio.run to manage the event loop.

asyncio.run(run_all())

Amnah199

LGTM! Thanks again @mathislucka.
Much appreciated!

alex-stoica · 2025-02-09T13:23:07Z

Excellent work @mathislucka 🚀! I've waited a lot for this feature! I hope also the documentation will be updated soon.
I've personally tested the implementation with a code a little bit adapted from deepset-ai/haystack-experimental#144 and it works as expected, I see both pipeline-wise concurrency and component-wise concurrency

import asyncio
from haystack import AsyncPipeline
from haystack import component
from datetime import datetime

def print_with_prefix(pipeline_prefix: str, component_name: str, message: str):
    """Prints a message prefixed with the pipeline and component name."""
    print(f"[{pipeline_prefix}] {component_name} {message} at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

async def async_sleep_task(duration: int):
    await asyncio.sleep(duration)

@component
class ComponentA:
    @component.output_types(A_output=str)
    def run(self, dummy: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentA", "run started")
        result = {"A_output": f"Processed by A: {dummy}"}
        print_with_prefix(prefix, "ComponentA", "run ended")
        return result

    @component.output_types(A_output=str)
    async def run_async(self, dummy: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentA", "run_async started")
        await async_sleep_task(3)
        result = {"A_output": f"Processed by A: {dummy}"}
        print_with_prefix(prefix, "ComponentA", "run_async ended")
        return result

@component
class ComponentB:
    @component.output_types(B_output=str)
    def run(self, dummy: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentB", "run started")
        result = {"B_output": f"Processed by B: {dummy}"}
        print_with_prefix(prefix, "ComponentB", "run ended")
        return result

    @component.output_types(B_output=str)
    async def run_async(self, dummy: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentB", "run_async started")
        await async_sleep_task(2)
        result = {"B_output": f"Processed by B: {dummy}"}
        print_with_prefix(prefix, "ComponentB", "run_async ended")
        return result

@component
class ComponentC:
    @component.output_types(C_output=str)
    def run(self, A_output: str, B_output: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentC", "run started")
        result = {"C_output": f"C combined outputs: {A_output}, {B_output}"}
        print_with_prefix(prefix, "ComponentC", "run ended")
        return result

    @component.output_types(C_output=str)
    async def run_async(self, A_output: str, B_output: str) -> dict:
        prefix = getattr(self, 'pipeline_name', 'Unknown')
        print_with_prefix(prefix, "ComponentC", "run_async started")
        await async_sleep_task(1)
        result = {"C_output": f"C combined outputs: {A_output}, {B_output}"}
        print_with_prefix(prefix, "ComponentC", "run_async ended")
        return result

def create_pipeline(name: str):
    pipeline = AsyncPipeline()
    pipeline.name = name
    
    comp_a = ComponentA()
    comp_b = ComponentB()
    comp_c = ComponentC()

    comp_a.pipeline_name = name
    comp_b.pipeline_name = name
    comp_c.pipeline_name = name

    pipeline.add_component("A", comp_a)
    pipeline.add_component("B", comp_b)
    pipeline.add_component("C", comp_c)
    pipeline.connect("A.A_output", "C.A_output")
    pipeline.connect("B.B_output", "C.B_output")
    return pipeline

if __name__ == "__main__":
    async def run_pipeline(pipeline, input_data):
        output = await pipeline.run_async(input_data)
        print(f"[{pipeline.name}] Pipeline output: {output}")

    async def main():
        input_data1 = {"dummy": "Test data 1"}
        input_data2 = {"dummy": "Test data 2"}
        pipeline1 = create_pipeline("P1")
        pipeline2 = create_pipeline("P2")
        # Run both pipelines concurrently.
        task1 = asyncio.create_task(run_pipeline(pipeline1, input_data1))
        task2 = asyncio.create_task(run_pipeline(pipeline2, input_data2))
        await asyncio.gather(task1, task2)

    asyncio.run(main())

…eepset-ai#8812) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <[email protected]> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * feat: add AsyncPipeline * chore: Add release notes * fix: format * debug: switch run order to debug ubuntu and windows tests * fix: consider priorities of other components while waiting for DEFER * refactor: simplify code * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * fix: track pipeline type * fix: and extend test * fix: format * style: sort alphabetically * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <[email protected]> * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <[email protected]> * Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml * fix: indentation, do not close loop * fix: use asyncio.run * fix: format --------- Co-authored-by: Amna Mubashar <[email protected]> Co-authored-by: David S. Batista <[email protected]>

mathislucka · 2025-02-10T08:45:39Z

Thanks @alex-stoica ! Yes, we will have updated documentation for the 2.10 release. I'm glad that you find it useful and that it works as expected.

Since this is a quite complex feature, please let me know if you find anything that doesn't work as you would expect.

mathislucka added 30 commits January 11, 2025 16:07

add component checks

9c34772

pipeline should run deterministically

fc2f2da

add FIFOQueue

c28353d

add agent tests

64f4afc

add order dependent tests

966552e

run new tests

3d0e948

remove code that is not needed

7664254

test: intermediate from cycle outputs are available outside cycle

932718f

add tests for component checks (Claude)

acce8cd

adapt tests for component checks (o1 review)

21c78b8

chore: format

cf23b32

remove tests that aren't needed anymore

54d4f2c

add _calculate_priority tests

05ed852

revert accidental change in pyproject.toml

93f7a9d

test format conversion

bbba1b2

adapt to naming convention

c00f8c5

chore: proper docstrings and type hints for PQ

235aa47

format

99ea5d5

add more unit tests

7577d9b

rm unneeded comments

25c64cd

test input consumption

5a1bbd8

lint

6820fc3

Merge branch 'main' into fix/pipeline_run

6a9cdcc

fix: docstrings

d624e57

lint

e1912f8

format

acc17b1

format

207eaba

fix license header

15611fc

fix license header

87010dd

add component run tests

7027572

mathislucka added 2 commits February 6, 2025 15:52

fix: and extend test

e9fbf7b

fix: format

56d2853

mathislucka requested review from davidsbatista and Amnah199 and removed request for vblagoje February 6, 2025 15:34

davidsbatista reviewed Feb 6, 2025

View reviewed changes

style: sort alphabetically

f2bde4f

Merge branch 'main' into feat/async_pipeline

ff57a5a

davidsbatista approved these changes Feb 7, 2025

View reviewed changes

Amnah199 reviewed Feb 7, 2025

View reviewed changes

test/core/pipeline/features/conftest.py Outdated Show resolved Hide resolved

Amnah199 reviewed Feb 7, 2025

View reviewed changes

test/core/pipeline/features/conftest.py Show resolved Hide resolved

mathislucka and others added 2 commits February 7, 2025 14:30

Update test/core/pipeline/features/conftest.py

06cfad9

Co-authored-by: Amna Mubashar <[email protected]>

Update test/core/pipeline/features/conftest.py

54f33eb

Co-authored-by: Amna Mubashar <[email protected]>

Amnah199 reviewed Feb 7, 2025

View reviewed changes

releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml Outdated Show resolved Hide resolved

Amnah199 and others added 2 commits February 7, 2025 14:33

Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

2d7c33d

fix: indentation, do not close loop

f56186c

Amnah199 reviewed Feb 7, 2025

View reviewed changes

mathislucka added 3 commits February 7, 2025 16:08

fix: use asyncio.run

57b1cac

fix: format

47f419c

Merge branch 'main' into feat/async_pipeline

a326a7b

Amnah199 approved these changes Feb 7, 2025

View reviewed changes

mathislucka merged commit e5b9bde into main Feb 7, 2025
18 checks passed

mathislucka deleted the feat/async_pipeline branch February 7, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AsyncPipeline that can schedule components to run concurrently #8812

feat: AsyncPipeline that can schedule components to run concurrently #8812

mathislucka commented Feb 4, 2025 •

edited

Loading

mathislucka commented Feb 6, 2025

davidsbatista Feb 6, 2025

davidsbatista commented Feb 7, 2025

davidsbatista left a comment

Amnah199 Feb 7, 2025

Amnah199 Feb 7, 2025 •

edited

Loading

Amnah199 left a comment

alex-stoica commented Feb 9, 2025

mathislucka commented Feb 10, 2025

		async_loop = asyncio.new_event_loop()
		asyncio.set_event_loop(async_loop)

feat: AsyncPipeline that can schedule components to run concurrently #8812

feat: AsyncPipeline that can schedule components to run concurrently #8812

Conversation

mathislucka commented Feb 4, 2025 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

mathislucka commented Feb 6, 2025

davidsbatista Feb 6, 2025

Choose a reason for hiding this comment

davidsbatista commented Feb 7, 2025

davidsbatista left a comment

Choose a reason for hiding this comment

Amnah199 Feb 7, 2025

Choose a reason for hiding this comment

Amnah199 Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Amnah199 left a comment

Choose a reason for hiding this comment

alex-stoica commented Feb 9, 2025

mathislucka commented Feb 10, 2025

mathislucka commented Feb 4, 2025 •

edited

Loading

Amnah199 Feb 7, 2025 •

edited

Loading